perf(xid_correlation): ratchet allocs/event 12.4 -> 1.88 (hit NORTHSTAR, #417) by trilamsr · Pull Request #448 · TraceCoreAI/tracecore

trilamsr · 2026-06-02T01:49:30Z

Summary

Lowers BenchmarkXidCorrelationDetector allocs/event from 12.40 → 1.88 (a 6.6x reduction), hitting the NORTHSTAR ≤2 target. Ratchets the absolute-ceiling gate from 13 → 2 and refreshes the baseline.

Follows the recipe established by PR #438 (nccl_hang: 3.99 → 1.70) and extends it with a composite-buffer + unsafe.String aliasing pass, which is load-bearing here because xid_correlation emits one verdict per evicted pod (896 verdicts / 1024-event bench window) vs nccl_hang's ~64 verdicts — per-verdict allocation cost dominates allocs/event.

Root cause

Per-verdict the detector built five variable-length prose strings (Headline, Remediation, xid-evidence UID, pod-evidence Description, plus EvictedPod from displayPodName's ns + "/" + name concat). Four of those used fmt.Sprintf, each costing one interface-boxing allocation per integer/string argument plus one result-string allocation. At 896 verdicts × ~11 allocs/verdict + a growing verdicts slice + a default-sized indexXidsByNode map, the measured floor was 12699 allocs/op.

Approach

Composite-buffer carve (the load-bearing change). buildXidCorrelationVerdict now pre-computes the rendered length of every variable-length string, allocates ONE []byte of exactly that total size, appends each field into a known offset range, then carves immutable string headers out via unsafe.String(&buf[lo], hi-lo). The buffer is finalized before any string is carved, so the aliasing is sound by construction. Per-verdict allocation floor drops from 6 to 2 (the composite buffer + the 2-element EvidenceTrail backing array).
Pre-size join structures. indexXidsByNode map is sized to len(xids); the verdicts slice is sized to len(events).
strconv-based int rendering via stack-allocated [20]byte scratch arrays + strconv.AppendInt (no fmt.Sprintf, no strconv.Itoa's allocation).
Strings.Builder fallback for xidEvidenceUID and xidEvidenceDescription (shared with hbm_ecc.go).

The unsafe.String aliasing trick is documented inline; the invariant ("buffer is immutable post-carve") is enforced by construction in buildXidCorrelationVerdict (no further append after the carve block).

Before / after

Bench: go test -bench=BenchmarkXidCorrelationDetector -benchmem -count=5 -benchtime=500ms ./bench/detectors/ on M1 Max.

Metric	Before	After	Delta
allocs/op	12699	1928	−85%
B/op	907 KiB	597 KiB	−34%
allocs/event	12.40	1.88	hits NORTHSTAR (≤2)

Profile entries fixed (per pprof -sample_index=alloc_objects):

xidCorrelationHeadline fmt.Sprintf → composite-buffer carve
xidCorrelationRemediation fmt.Sprintf → composite-buffer carve
xidEvidenceUID fmt.Sprintf → composite-buffer carve
pod-event Description fmt.Sprintf → composite-buffer carve
displayPodName ns + "/" + name concat → folded into composite buffer
indexXidsByNode map → pre-sized to len(xids)
verdicts slice grow chain → pre-sized to len(events)

Residual per-verdict allocations (the two unavoidable ones at NORTHSTAR floor):

Composite prose buffer (1 alloc, ~660 B holding all 5 strings).
EvidenceTrail 2-element backing array (1 alloc, escapes on the returned verdict).

Semantics

Golden-replay byte-identity preserved: module/pkg/replay/xid_correlation/canonical/golden.json passes unchanged. The unsafe.String-carved strings alias an immutable buffer; JSON serialization compares content equality, not pointer identity.

Files changed

module/pkg/patterns/xid_correlation.go — composite-buffer rewrite of buildXidCorrelationVerdict; pre-sized join maps; dead Sprintf-era renderers removed.
bench/detectors/baselines.json — refreshed BenchmarkXidCorrelationDetector entry to the new floor.
scripts/bench-registry.sh — ceiling 13 → 2; comment updated to record the ratchet and NORTHSTAR-hit.

Test plan

go test -race ./pkg/patterns/... ./pkg/replay/... ./processor/... -count=1
go test -bench=BenchmarkXidCorrelationDetector -benchmem -count=5 ./bench/detectors/ — stable at 1928 allocs/op
./scripts/bench-allocs-check.sh — every detector at-or-below ceiling
./scripts/bench-check-detectors.sh — within 10% of baseline
go vet ./..., golangci-lint run ./... — clean (pre-commit hook)
module/pkg/replay/xid_correlation/canonical/golden.json — byte-identical replay

Sibling work

perf(nccl_hang): lower allocs/event from 3.99 toward NORTHSTAR (2) #418 / PR perf(nccl_hang): ratchet allocs/event 3.99 -> 1.70 (hit NORTHSTAR, #418) #438 (nccl_hang): merged, established the strconv-builder recipe.
perf(pod_evicted): lower allocs/event from 15.27 toward NORTHSTAR (2) #434 (pod_evicted, 15.27/ev): remains open; same composite-buffer technique applies if its per-evicted-pod shape allows.

perf: xid_correlation detector allocations dropped from 12.4 to 1.88 per event (hits the ≤2 NORTHSTAR target). Verdict output is byte-identical.

Closes #417

Pack all five variable-length prose strings per verdict (EvictedPod, Headline, Remediation, xid-evidence UID, pod-evidence description) into a single composite byte buffer, then carve immutable string headers out of that buffer via unsafe.String aliasing. Combined with the 2-element EvidenceTrail backing array, the per-verdict floor is 2 allocations regardless of input scale. Detector hits NORTHSTAR. Recipe mirrors PR #438 (nccl_hang): pre-size join maps and verdicts slice; replace fmt.Sprintf with strconv.AppendInt-based builders; fold per-verdict integer formatting onto stack scratch arrays. Adds a composite-buffer + unsafe.String pass that #438 did not need (nccl_hang emits ~64 verdicts vs xid_correlation's 896 per bench pass -- per-verdict cost dominates here). Bench (bench/detectors/, count=5, M1 Max): before: 12699 allocs/op, 907 KiB/op, 12.40 allocs/event after: 1928 allocs/op, 597 KiB/op, 1.88 allocs/event delta: -85% allocs, -34% bytes; hits NORTHSTAR (<=2) Hotspots eliminated (per memprofile alloc_objects): - xidCorrelationHeadline fmt.Sprintf -> composite-buffer carve - xidCorrelationRemediation fmt.Sprintf -> composite-buffer carve - xidEvidenceUID fmt.Sprintf -> composite-buffer carve - pod-event Description fmt.Sprintf -> composite-buffer carve - displayPodName concat folded into composite buffer - indexXidsByNode map pre-sized to len(xids) - verdicts slice grow pre-sized to len(events) Verdict semantics byte-identical: golden replay (module/pkg/replay/xid_correlation/canonical/golden.json) passes unchanged. The carved strings alias an immutable backing buffer; JSON serialization compares content equality, not pointer identity. Ceiling lowered from 13 -> 2 in scripts/bench-registry.sh; baseline refreshed in bench/detectors/baselines.json. Closes #417 Signed-off-by: Tri Lam <tree@lumalabs.ai>

trilamsr · 2026-06-02T01:55:37Z

Review: xid_correlation allocs 12.4 → 1.88 (NORTHSTAR hit)

B/A/A+ Criteria

Detector correctness: Verdict output byte-identical (golden replay confirmed via no golden.json changes). Prose strings match original semantics across all 5 fields (EvictedPod, Headline, Remediation, xid-UID, pod-description).

Go perf discipline: unsafe.String aliasing is sound by construction (buffer finalized before carve, carved strings don't outlive heap-allocated backing, no further appends post-carve). Stack-allocated scratch arrays don't escape (copied into composite before next use).

Simplicity: Approach mirrors established nccl_hang pattern (PR #438). Dead functions inlined (xidCorrelationHeadline, xidCorrelationRemediation). Helpers justifiable: intToScratch (3 uses, stack-safety documentation), stringSlice (5 uses, contract clarity), writeInt (3 uses, builder encapsulation).

Unsafe.String Safety Audit

Buffer finalization (CRITICAL): Lines 158-209 fill composite buffer via appends; lines 212-216 carve unsafe.String aliases; no appends after line 209. Safe.

Length pre-computation: podDisplayLength (lines 248-259) mirrors displayPodName (pod_evicted.go:189-196) exactly across all three branches. Capacity pre-sizing correct.

Carved string lifetimes: composite is heap-allocated in buildXidCorrelationVerdict; carved strings escape in returned XidCorrelationVerdict struct. Go runtime keeps heap buffer alive as long as strings reference it (standard semantics).

Self-reference appends safe: Lines 186, 199, 206 read already-appended ranges (composite[podStart:podEnd]) within the same growing buffer. No aliasing to external buffers.

Off-by-one risk: None. lo/hi computed from len() calls within buildXidCorrelationVerdict; indices are local state.

Data race risk: None. Evaluate() is single-threaded; inputs are read-only snapshots; no goroutine spawning.

Go version: go.mod specifies 1.26.3; unsafe.String available since 1.20. ✓

Simplification Sweep: Clean

Deleted two standalone functions (xidCorrelationHeadline, xidCorrelationRemediation) inlined into composite-buffer build. Added helpers justified at ≥3 uses except writeInt (3 uses, at threshold but improves builder readability + encapsulates stack pattern). No noise comments; all doc strings serve unsafe.String contract or alloc-discipline purpose.

Benchmark Gate

Ratchet 13 → 2 allocs/event:

Measured floor: 1.88 (composite buffer + EvidenceTrail backing array)
Ceiling 2: NORTHSTAR target, justified
Slack budget: 6.4% (conservative vs nccl_hang's 17.6%)
Gate is appropriate (hard ceiling prevents creep on NORTHSTAR detector)

Verdict

A+ — Unsafe.String aliasing is sound. Golden replay byte-identical. Follows established perf recipe (composite buffer + strconv builders + pre-sized structures). Benchmark gate sensible. Code quality high.

Address reviewer A- trim items on #456: 1. displayPodName: drop optimization-impact comment (function inlined in pod_evicted; kept exported for xid_correlation use). 2. Evaluate: collapse 9-bullet allocation-changelog into one-line NORTHSTAR pointer + per-file note. 3. Consolidate fmt->strconv narrative into one top-of-Evaluate note; drop the per-site repetition across five render helpers. 4. containsFold: replace 'equivalent to' note with explicit ASCII-only safety guard (anchors + kubelet eviction messages are English). No behavior change: allocs/op = 1943 unchanged, golden replay byte-identical, bench-allocs-check still PASS. Verified finding 5: scripts/bench-registry.sh xid_correlation ceiling already 2 (post-#448), no change required. Signed-off-by: Tri Lam <tree@lumalabs.ai>

…434) (#456) ## Summary Closes #434. Profile-driven optimization of `PodEvictedDetector.Evaluate` drops per-event allocations from **15.27 → 1.90**, clearing the **NORTHSTAR (2)** floor. Bench ceiling lowered 16 → 2 in `scripts/bench-registry.sh::allocs_gate`; `bench/detectors/baselines.json` updated. Same recipe as #438 (nccl_hang) and #448 (xid_correlation). ## Before / after `go test -bench=BenchmarkPodEvictedDetector -benchmem -count=10 -benchtime=500ms ./bench/detectors/` (Apple M1 Max): | metric | before | after | delta | |---|---|---|---| | allocs/op | 15635 | 1943 | **-87.6%** | | B/op | 970020 | 473400 | -51.2% | | allocs/event | 15.27 | **1.90** | hit NORTHSTAR | Stability: 10 consecutive bench runs all measured 1943 allocs/op (CV = 0%). ## Root cause (named) Per-evicted-pod the previous detector allocated ~19 objects. Profile (`-memprofile`, `-alloc_objects`) attributed them: 1. **`fmt.Sprintf` ×5 in `buildVerdict` + helpers** — 60% of pre-fix allocs. Sprintf boxes each arg into an `any` slice + scans the format string into a fresh internal buffer; both escape. Replaced with `strconv.AppendInt` + manual byte builders. 2. **`strings.ToLower(note)` in `pressureFromNote`** — 1 alloc per evicted pod (the lowercased copy). Replaced with a zero-alloc `containsFold` ASCII-fold scan (all anchors are already lowercase). 3. **`time.Format` in `formatTimestamp`** — Sprintf-style alloc per headline. Replaced with `t.AppendFormat(buf, time.RFC3339)` writing into the shared scratch buffer. 4. **Unsized `condIdx` map + per-bucket `append` growth** — `make(map[K]V)` defaults to 8 buckets; `append(nil, r)` grows each bucket. Pre-sized to `len(recs)`. 5. **Per-verdict `make([]EvidenceRef, 1, 2)`** — 1 alloc per verdict for the trail slice header. Replaced with ONE contiguous `trailBacking := make([]EvidenceRef, 2*len(events))` sliced cap=2 per verdict. 6. **Escaping `make([]byte, dynamic-cap)` per builder call** — escape analysis can't prove the dynamic capacity fits on stack, so each builder allocated TWO objects (buf + the `string(buf)` cast). Replaced with a per-Evaluate `scratch *[]byte` reused across every builder (each resets to `(*scratch)[:0]` and appends), collapsing to ONE alloc per call (the irreducible string cast). 7. **Redundant per-condition re-render** — 820 evictions joining onto 64 conditions re-rendered the same UID + description + remediation strings on every join (2460 redundant allocs). Pre-rendered once at index time into a new `indexedNodeCond` struct. 8. **Partial-path remediation re-render** — partial-path evictions (no joined condition) re-rendered the `"On node X: <prose>"` remediation per pod. Added a `(node, pressure) → string` cache so burst evictions on the same node share one alloc. 9. **`sort.SliceStable` reflection** — both `indexNodeConds`'s per-bucket sort and the outer verdicts sort used reflection-based `sort.SliceStable` (allocates a swapper per call). Replaced with `slices.SortStableFunc` (generics, no reflection); buckets of length ≤1 skip the sort call entirely. ## Top profile entries (memprofile -alloc_objects) **Before:** ``` 60% buildVerdict (5 fmt.Sprintf calls per eviction) 22% fmt.Sprintf 6% nodeConditionDescription 4% annotateRemediationWithNode 3% time.Time.Format 4% strings.ToLower ``` **After:** ``` 42% podEventDescription (1 alloc/call — irreducible string buf cast) 36% renderHeadline (1 alloc/call — irreducible string buf cast) 7% annotateRemediationWithNode (partial-path cache misses only) 5% nodeConditionDescription (index-time pre-render, 64 calls total) 3% nodeConditionUID (index-time pre-render, 64 calls total) ``` The two per-verdict string allocations (headline + pod_event description) are now the irreducible floor: each is a distinct verdict field that downstream `json.Marshal` requires as separate strings. ## Verdict semantics — unchanged Byte-identical headlines, remediation, UIDs, descriptions, JSON envelope. Confirmed via: - `module/pkg/replay/pod_evicted/canonical/golden.json` round-trips unchanged. - All `TestPodEvicted*` unit tests pass (negative fixtures, deterministic order, partial path, out-of-window, future transition excluded, empty node message, remediation pins node name, schema conformance + drift rejection). - `TestPodEvictedVerdict_SchemaConformance` (canonical full + partial) passes. ## Gate state | gate | result | |---|---| | `scripts/bench-allocs-check.sh` (hard absolute ceiling) | PASS at new ceiling=2 | | `scripts/bench-check-detectors.sh` (10% delta soft gate) | PASS (no detector regressed) | | `go test -race ./module/pkg/patterns/ ./module/pkg/replay/ ./module/processor/...` | PASS | | `go vet ./module/pkg/patterns/` | clean | | pre-commit (golangci-lint, vet, mod verify, attribute-namespace-check) | PASS | ```release-notes pod_evicted detector now allocates 1.90 allocations per event (down from 15.27), hitting the v0.3.0 NORTHSTAR floor of <=2 allocs/event. Joins #418 (nccl_hang) and #417 (xid_correlation) — three of the six per-detector tracking issues at or below NORTHSTAR. ``` Closes #434 Refs #302 ## Test plan - [x] `go test -race -count=1 ./module/pkg/patterns/ ./module/pkg/replay/ ./module/processor/...` PASS - [x] `go vet ./module/pkg/patterns/` clean - [x] `bash scripts/bench-allocs-check.sh` PASS (pod_evicted 1.90/ev <= ceiling 2; all 6 detectors at-or-below ceiling) - [x] `bash scripts/bench-check-detectors.sh` PASS (no detector regressed >10%) - [x] `go test -bench=BenchmarkPodEvictedDetector -count=10 -benchtime=500ms` stable at 1943 allocs/op (CV=0%) - [x] Golden replay (`module/pkg/replay/pod_evicted/canonical/golden.json`) round-trips unchanged - [x] Rebased onto origin/main (resolved scripts/bench-registry.sh conflict with #448's xid_correlation ratchet — both ratchets shipped together in this branch's view) --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>

## Summary `scripts/bench-registry.sh::bench_entries` covered only `BenchmarkPodEvictedDetector` and the pyspy parser benches, so the **% delta regression gate** (`scripts/bench-check.sh` via `bench-check-all.sh`) missed five of the six detectors. The absolute allocs/event ceiling (`allocs_gate`) caught a detector that drifted past its hard ceiling, but a detector that drifted +50% while staying under the ceiling would ship silently. This PR closes the gap by registering all six detectors against both gates. ## Root cause The registry was authored when `pod_evicted` was the only detector with a bench. The five later detectors (added in #417/#418/#434/#438/#448/#456) were plumbed into `allocs_gate` (absolute ceiling) but the parallel % delta gate was never extended — a historical artifact, not an architectural defect. Fix: extend `bench_entries` to cover the missing five. ## What changed - `scripts/bench-registry.sh`: added one entry covering five detectors against `./bench/detectors/`, single-regex / single `go test` invocation (~5s incremental vs ~30s if split into 5 entries). - `bench/detectors/testdata/bench-baseline.txt`: generated baseline (count=10 × 500ms, Apple M1 Max). Hardware-invariant signals (B/op + allocs/op) pin the gate; sec/op stays advisory. - Existing ceilings in `allocs_gate` unchanged. Existing baselines unchanged. ## Mutation verification Bumped `PCIeAERDetector` allocs baseline down by ~18% (524 → 430): ``` PCIeAERDetector-10 430.0 ± 0% 524.0 ± 0% +21.86% (p=0.000 n=10) REGRESSION: the following benchmarks exceeded the 10% threshold vs baseline: PCIeAERDetector-10 +21.86% (p=0.000 n=10) ``` `scripts/bench-check-all.sh` exit 1. Revert → exit 0. Gate fires on regression, stays clean on baseline. ## Coverage gap context Cross-link: #302 (allocs/event rollup), #417 (xid_correlation NORTHSTAR), #418 (nccl_hang NORTHSTAR), #434 (pod_evicted NORTHSTAR), #438/#448/#456 (sibling perf PRs). Every detector that hit NORTHSTAR under the absolute ceiling now also has a relative-regression gate. ## Test plan - [x] `make bench-check` exit 0 (covers all 6 detectors now) - [x] `make bench-allocs-check` exit 0 - [x] `make bench-detectors-check` exit 0 (soft-gate, unchanged) - [x] Mutation: lower baseline → exit 1; revert → exit 0 - [x] Lint + vet (pre-commit hook): 0 issues ```release-notes ci(bench): % delta regression gate now covers all six detectors (#302). xid_correlation, hbm_ecc, nccl_hang, thermal_throttle, and pcie_aer were previously gated only by the absolute allocs/event ceiling — they now also fail builds on >10% allocs/op or B/op drift vs the committed baseline. ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>

trilamsr enabled auto-merge (squash) June 2, 2026 01:55

trilamsr merged commit 3fbcc4e into main Jun 2, 2026
23 checks passed

trilamsr deleted the perf/417-xid-correlation-allocs branch June 2, 2026 01:58

trilamsr mentioned this pull request Jun 2, 2026

perf(pod_evicted): ratchet allocs/event 15.27 -> 1.90 (hit NORTHSTAR, #434) #456

Merged

7 tasks

This was referenced Jun 2, 2026

ci(bench): cover all 6 detectors in % delta gate (#302) #483

Merged

audit(wave-2026-06-02): autonomous-wave cross-cut review #488

Closed

trilamsr mentioned this pull request Jun 2, 2026

chore: sweep dead code + close module/ lint-gate gap #486

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(xid_correlation): ratchet allocs/event 12.4 -> 1.88 (hit NORTHSTAR, #417)#448

perf(xid_correlation): ratchet allocs/event 12.4 -> 1.88 (hit NORTHSTAR, #417)#448
trilamsr merged 1 commit into
mainfrom
perf/417-xid-correlation-allocs

trilamsr commented Jun 2, 2026

Uh oh!

trilamsr commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented Jun 2, 2026

Summary

Root cause

Approach

Before / after

Semantics

Files changed

Test plan

Sibling work

Uh oh!

trilamsr commented Jun 2, 2026

Review: xid_correlation allocs 12.4 → 1.88 (NORTHSTAR hit)

B/A/A+ Criteria

Unsafe.String Safety Audit

Simplification Sweep: Clean

Benchmark Gate

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant